Overview

This document is intended for Kunpeng application developers. Based on the features of the Kunpeng 920 processor, this document describes how to optimize code performance during Kunpeng application development from multiple dimensions.

First, the Kunpeng processor is briefly introduced. The Kunpeng processor is an ARM-based enterprise-level processor. In terms of the general-purpose computing processor, Huawei released the first ARM-based 64-bit Kunpeng 912 processor in 2014, and the Kunpeng 916 processor released in 2016 is the industry's first ARM-based processor that supports multi-socket interconnection. The third-generation Kunpeng 920 processor released in January 2019 is the industry's first ARM-based processor at data center level that uses the 7 nm process.

Figure 1 shows the evolution roadmap of the Huawei Kunpeng processor family.

Figure 1 Kunpeng processor roadmap

This document describes the Kunpeng programming optimization methods in terms of hardware and software. Figure 2 shows the programming optimization framework.

Figure 2 Programming optimization framework

At the hardware level, the optimization and tuning methods are described based on the features of the Kunpeng processor.

The Kunpeng 920 processor uses the non-uniform memory access (NUMA) architecture, which solves the restriction of the symmetrical multiprocessing (SMP) technology on the number of CPU cores. Therefore, multi-core is one of major advantages of the Kunpeng processor. However, when programs run concurrently, cross-NUMA memory access hinders program performance. This document describes some programming methods for improving NUMA affinity of programs in Multi-core and NUMA.

The SoC of the Kunpeng 920 processor uses the core developed by Huawei HiSilicon. For a single processor core, the instruction time parallel technology, that is, instruction pipeline, plays a leading role in performance improvement. The Kunpeng 920 processor supports multi-level pipelines. Pipeline describes how to optimize pipeline orchestration to improve the throughput and efficiency of the pipeline and give full play to the processor performance. In Cache and Prefetch, based on the L3 cache of the Kunpeng 920 processor, this document introduces some program data structure arrangement and access methods, which can significantly improve the cache hit ratio.

At the software level, instruction sets, compilers, and acceleration libraries are introduced.

The Kunpeng 920 processor is fully compatible with the Armv8-A instruction set. For details about how to use the NEON instructions, see SIMD Programming. After the code development is complete, the compiler needs to translate the code into executable files that adapt to different platforms. This document describes how to use the features of the compiler to optimize code in Compiler.

In addition to the preceding common methods, Huawei releases the Kunpeng BoostKit, a full-stack optimization solution for Kunpeng hardware, base software, and application software. It provides high-performance open source components, basic acceleration software packages, and application acceleration software packages to enable optimal application performance. In Kunpeng BoostKit Library, this document describes how to use the Kunpeng basic acceleration libraries.